fpgafandomcom-20200214-history
SQRL ACORN GPU ACCELERATOR GUIDE
I have condensed some of the nitty gritty technical nuggets that GPUHoarder gave to us during his recent stream of consciousness earlier on the FPGA Discord Server. I have taken great liberty in editing and extrapolating in an attempt for additional clarity. This could mean that it contains errors or misunderstandings on my part. I am happy to make corrections and additions as needed. What I am envisioning is that this could be just the start of a comprehensive FPGA reference guide put together by the community, for the community, that is well-organized and accessible in a way that is more suitable than what is allowed by Discord. With the help of many knowledgeable community members as editors of this wiki, this could be the start of something big. We could also host this on a different platform so as to avoid all of the annoying ads... ~ BryGuy#8913 ACORN/GPU COMMUNICATION PATTERNS The CPU/GPU driver allocates a memory buffer on the GPU which has an actual address space on the PCIe bus. The code sends this address space to the acorn, the acorn sends hash state data to this address range for that GPU. The PCIe switch knows the destination and routes traffic accordingly without needing to go through the CPU. Sometimes the GPU puts data in a buffer for the acorn (not all algos). The acorn can read this data, but since on PCIe writes are more efficient than reads, the GPU will instead write by transferring the data through CPU memory and ultimately to the acorn. If this is not feasible, then the acorn will read this data directly from the GPU. Simultaneous tranmission (full duplex) incurs some penalty due to completions on the bus, but isn’t needed and we try to avoid it if possible. Any given algorithm can involve the following sequences of data transmission: * Acorn to GPU (AG) * GPU to Acorn (GA) * Acorn to GPU and back to Acorn (AGA) * GPU to Acorn and back to GPU (GAG) The Acorn can handle multiple internal algorithms, but whereas the GPU is time-limited (x per second for each algorithm), the Acorn is space limited. The acorn can achieve much higher hashrates for pieces of an algorithm than the PCIe bus can handle internally, so the optimal bitstream design is about fitting as many pieces together as possible at once on the chip. Example 1: Lyra2Rev2 Only the the Lyra algo portion lives on the GPU. The communication sequence in this case is therefore from Acorn to GPU and then back to Acorn. (AGA) Example 2: X16R 6 of the 16 individual algorithms that are chained to make X16R can be efficiently packed on to a single Acorn. Since the order of the 16 algorithms used in the chain depend on the hash result of the previous block and this whole chunk spins around at 30 megahashes per second, the data will be used with varying degrees of efficiency depending on each block. This means that the optimal sequence of communication will vary and that data might flow from Acorn to GPU and then back to Acorn (AGA) or vice versa (GAG) depending on the situation. If the average algorithm, for example, takes 5 nanoseconds to complete on the GPU, with the help of the Acorn performing the algorithms that it is most efficient at,the average algorithm can be completed in just 68% of the time, or in this case, 3.4 nanoseconds. ACORN NEST X2G COMMUNICATION There is no inherent performance bonus for the Acorn when communicating via the NEST X2G riser vs.any other similar throughput available via PCIe. Performance all comes down to PCIe bandwidth, regardless of placement of the Acorn. The NEST X2G isn’t magic, but it does offer on-board switch/bandwidth that doesn’t have to go out to the Chipset or CPU, and can also optimize a system by allowing the full potential of 16 lanes of PCIe 2.0 from the following slots on a motherboard, without the need for bifurcation support: PCIe 2.0 x 8 PCIe 2.0 x 16 PCIe 3.0 x 8 PCIe 3.0 x 16 PCIe 3.0 x 4 (not ideal, slightly throttled vs. other options above) PCIE BANDWIDTH MANAGEMENT AND BOTTLENECKS WHEN USING ACORN The Acorn uses nominally 2 GB/s and that transfer speed will be made spread to as few or as many GPUs as needed based on your slots and how much each GPU effectively utilizes an the potential of each Acorn. For any given algorithm, you’re limited by one of three things: GPU, Acorn, or available PCIe bandwidth. You are almost never limited by the Acorn since it was designed and sized to always be able to make full use of all available PCIe bandwidth. Exceptions to this can be seen in the dual Acorn CLE-215+ chart, where the GPU can handle more than 2GB/s of accelerations. Often times, the cores on the Acorn are running as much as 50% higher than PCIe can handle, so theoretically, the Acorn could be utilized in a solo/dual mining situation that is even greater than might be suggested by merely looking at Acorn utilization in the charts. MORE INFORMATION For more information about the Acorn and the entire SQRL team that brought it to life, visit http://squirrelsresearch.com/